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Abstract: This work presents an efficient hardware implementation of a hardware accelerator for the 
computation of the Modified Discrete Sine transform (MDST) using a new VLSI algorithm based on a 
appropriate reformulation of the MDST algorithm using some auxiliary input and output sequences. The 
obtained hardware implementation is using a low complexity implementation based on only 
adders/subtracters and has a reduced critical path that can be exploited to obtain a significant reduction of the 
power consumption. 
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1. Introduction 


The Modified Discrete Cosine Transform (MDCT), the Modified Discrete 
Sine Transform (MDST), and their inverse transforms (IMDCT and IMDST) are 
used in subband analysis/synthesis approaches [1],[2] that have been used to 
construct filter-banks used in Dolby Enhanced AC3 (E-AC-3) audio coding 
standard [3] and some other audio coding standards [4]-[6]. 

MDCT and MDST are computational intensive as also DCT and DST and 
efficient software and hardware algorithms and implementations are required for a 
real-time implementation. 

There are several efficient software implementations [7]-[10] and some 
hardware implementations [11]-[20] but all of these hardware solutions are based 
on recursive algorithms. 

Although it is possible to establish a quite simple relation between the 
MDCT and the MDST, that allows us to concentrate more on the investigation of 
a fast MDCT algorithm and implementations, the MDST computation through the 
MDCT algorithm still takes extra executing time, although both of the MDCT and 
MDST hardware accelerators could be unified. This indicates that the MDCT and 
MDST coefficients cannot be simultaneously computed. Since these equations 
are all dependent, how to efficiently compute the MDCT and MDST coefficients 
is still a challenging problem. 

In this paper we propose a direct method to efficiently implement the 
MDST algorithm using some auxiliary input and output sequences. Thus, we can 
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obtain an efficient VLSI implementation for a hardware accelerator using only 
adders/subtracters and having a reduced critical path that allows in the same time 
a high speed implementation using a reduced power consumption. 

The rest of the paper is organized as follows: 

In Section 2 it is presented the new VLSI algorithm for MDST that allows 
an efficient hardware implementation for the hardware accelerator that works 
together with a host computational structure. 

In Section 3 it is presented the VLSI implementation of the proposed 
algorithm and in Section 4 some conclusions are derived. 


2. A New VLSI Algorithm for Modified DST 


The 1-D MDST for a real input sequence x(i):i=0,1,...,N —1, is defined as: 


Y(k)= > x(i)- sin{( 2i+14+N/2)(2k +a] (1) 
fork =0, ..., M-1 
where: M=N/2 


and pe (2) 
2N 
We introduce some restructuring input sequences defined bellow: 


x, (i) = x(i)- sin[( 2i+1+N/2)a] (3) 
x, (i) = x(i)-cos[(2i+1+ N/2)a)] (4) 
x,@) =(%-() + xe (NW -1-1)) (5) 
x, (i) = (Xe (i) — X¢(N -1-1)) (6) 


Using these auxilliary input sequences, it is possible to obtain an efficient 
implementation of a hardware accelerator for the computation of MDST. 

Thus, when the length of the transform is a power of 2 it is possible to 
significantly reduce the hardware complexity of the designed accelerator. For 
example, when N=8 we have the following equations for our hardware 
accelerator: 


T(0)=0 (7) 
T(2) = —2- cos(=)[(x,(0)— x,(3))+ @@) —x,(2))] (8) 


and 
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T(1)=2- [cos (=) (x,,(0)—x, (3) + cos (3 =) (x, (1) — x, (2))] (9) 

7(3) = 2- [eos (=) (x.(1)— x4(2)) —c08(32) (x.-*,))] 10 
Using a subexpression technique we can further reduce the number of 
adders/subtracters. Thus, we can note: 


x, (0,3) = x, (0) —x,(3) (11) 

x, (1,2) = x, (1) —x,(2) (12) 
and 

x,,(0,3) = x, (0) — x,(3) (13) 

x, (1,2) = x,(1)— x, (2) (14) 


Thus, equations (6)-(9) can be rewritten as: 


T(0)=0 (15) 
T(2) = —2- cos (=) [x, (0,3) + x,(1,3)] (16) 
and 
T(1) = 2- [cos(=) x,(0,3) + cos(3=)x,(1,2)] (17) 
T(3) = 2- [cos (=)x, (1,2) — cos(3=)x,(0,3)] (18) 


Finally, the output sequence can be recursively computed using equations (19) 
and (20) as follows: 


YO)=>%,@ (19) 
¥(k) =T(k)-Y(k-1) (20) 
for k=1,...,M-1 


Thus, using the auxilliary output sequence {T(k):k=1,....M-1} we can recursively 
compute the final output sequence using an accumulator structure. 
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3. A VLSI Implementation of MDST Using the Proposed VLSI 
Algorithm 


The equations (17) and (18) that are computing the odd samples T(1) and T(3) 
of the auxilliary output sequence T(k) can be implemented using the hardware 
structure from Figure 1. It can be seen that T(1) and T(3) can be computed using 4 
multiplications with the constants cos(pi/N) and cos(3pi/N). Due to the fact that 
we have 4 multiplications where 2 multiplications are using the same constant we 
can further reduce the number of multipliers at 2. Moreover, the multiplications 
with a constant can be efficiently implemented using only adders/subtracters and 
shift operations as will be shown bellow. 


T(3) a 0 


= x 
cos (3 yw cos G 


Figure 1. The architecture that implements equations (17) and (18) 


The function of the processing elements from Figure | is presented in 
Figure 2. It can be seen that the partial result y is added with the result of the 
multiplication of the input sequence x with the constant of the multiplier c. The 
input sequence x is also forwarded to the next processing element as x’. 
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Figure 2. The function of the processing elements from Fig.1 


The even samples of the auxilliary output sequence T(k) are 
computed using the following hardware structure that implements equation 
(16): 


Le 
cos ( zy 


xp(0.3) 


yee 
OT xp(1.2) 


TQ) 


Figure 3. The hardware implementation of equation (16) 


It can be seen from Figure 3 that for the computing of the even part of the 
auxilliary output sequence we are using only a multiplier with a constant and an 
adder. As already has been mentioned, the multiplication with a constant can be 
implemented using addders/subtracters and shift operations. 
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Table 1. Signed Digit representations of the constants used in constant multipliers 


No of 
Coefficient (€) Representation 
adders/subtractors 


c(1)=cos(pi/4) | 2°-2 7-2 *+4+2* +427" 


c(2)=cos(pi/8) vw —7*—7*42° 3 


c(3)=cos(3pi/8) | 2-1-2? +277 —2°7 | 3 


As can be seen from Table I we can efficiently implement the multipliers 
with a constants from Figure 1 and Figure 3 using a Signed Digit (SD) 
representation of the constants cos(p1/4), cos(pi/8) and cos(3pi/8). We are using 
only 3 or 4 adders/subtracters and shift operations. Shift operations can be 
implemented using appropriate interconnections without any hardware circuits. 

The implementations of the 3 multipliers are presented in Figure 4, Figure 
5 and Figure 6. 


As can be seen from Figure 4 in order to implement the multiplication 
with the constant c(1)=cos(pi/4) we are using 2 subtractors and 2 adders. The shift 
operations does not involve any suplimentary circuits but only appropriate 
interconnections. To reduce the critical path of the circuit we are using pipelining. 
At the intersection between the cut-set lines (represented with dot lines) from 
Figure 4 with the communication links we are placing the pipeline registers. Thus, 
the critical path has been reduced at Ta, where Ta is the delay of an 
adder/subtracter. 
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Figure 4. The implementation of the multiplier with the constant c(1)=cos(pi/4) 


As can be seen from Figure 5 in order to implement the multiplication 
with the constant c(2)=cos(pi/8) we are using 2 adders and one subtractor. As has 
been shown before, the shift operations does not involve any suplimentary 
circuits. We have also reduced the critical path at the value Ta using pipelining. 
At the intersection between the cut set lines from Figure 5 with the 
communication links we are placing the pipeline registers. 


0 
(7 a in + e(2) 


2 


Figure 5. The implementation of the constant multiplier with the constant c(2)=cos(pi/8) 
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a , in + ¢(3) 
DD 


Figure 6. The implementation of the constant multiplier with constant c(3)=cos(3pi/8) 


As can be seen from Figure 6 in order to implement the multiplication 
with the constant c(3)=cos(3pi/8) we are using 2 adders and one subtractor. The 
shift operations does not involve any suplimentary circuits and the critical path 
have been reduced at the value Ta using pipelining. At the intersection between 
the cut set lines from Figure 6 with the communication links we are placing the 
pipeline registers. 


Thus, in order to implement the hardware accelerator we are using only 3 
multipliers that are implemented using only 11 adders and 4 subtracters and the 
critical path has been reduced at the value Ta. This, can be used to reduce the 
power consumption by increasing the delay while reducing the power consumtion 
by reducing the suply voltage. 


4. Conclusions 


An efficient hardware implementation of Modified Discrete Sine transform 
(MDST) with a reduced hardware complexity and high speed performances is 
presented. It uses an appropriate reformulation of the MDST algorithm using 
some auxiliary input and output sequences. The obtained hardware 
implementation of the proposed hardware accelerator is using a low complexity 
implementation using only 11 adders and 4 subtracters and has a reduced critical 
path that can be used to obtain a further reduction of the power consumption. 
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